Transcribing Speech: Errors in Corpora and Experimental Settings

نویسنده

  • Isabella Chiari
چکیده

Administrations, government organs, judiciary courts always faced the problem of defining limits in transcription practices. Nowadays corpus linguistics and computational linguistics have focused their attention on spoken corpora as indispensable tools for descriptive linguistics, as well as for applied purposes (in speech technologies, such as text-to-speech and speech recognition, in dialogue systems, in natural language processing and information retrieval, etc.). But transcription is not just a meta-linguistic practice serving linguistic analysis, it is, at the same time, a linguistic act itself, governed by its own strategies and tightly linked to other speech acts, and linguistic practices (such as note-taking, listening to spoken language for different purposes, writing following dictation, etc.). Recent literature has often been centred on transcription system design, on reviewing and comparing different transcription systems (Chafe, 1995; Connell and Kowal, 1999; Cook, 1995; Derville, 1997; Edwards and Lampert, 1993; Lapadat, 2000; Leech, Myers and Thomas, 1995; Pallaud, 2003; Romero, O’Connell and Kowal, 2002), and on errors and inconsistencies in linguistic annotation. Furthermore a large tradition in transcription is common in ethnographic studies (Powers, 2005; Vigouroux, 2007) and in conversation analysis (Ashmore and Reed, 2000). Transcription of speech is often driven by different transcribers’ understanding strategies, leading to specific error typologies (Chiari, 2006a; Chiari, 2006b; Lindsay and O’Connell, 1995; Pallaud, 2002; Pallaud, 2003). How does the transcriber contribute to the reconstruction or mis-reproduction of the spoken text? An analysis of common errors derived from experimentally induced transcriptions and from spoken reference corpora of the Italian language are compared and analyzed quantitatively and qualitatively in order to observe different patterns, relative frequencies, and motivations of occurrence.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Developing Partially-Transcribed Speech Corpus from Edited Transcriptions

Large-scale spontaneous speech corpora are crucial resource for various domains of spoken language processing. However, the available corpora are usually limited because their construction cost is quite expensive especially in transcribing speech precisely. On the other hand, loosely transcribed corpora like shorthand notes, meeting records and closed captions are more widely available than pre...

متن کامل

Automatic Transcription Verification of Broadcast News and Similar Speech Corpora

In the last few years, the focus in ASR research has shifted from the recognition of clean read speech (i.e. WSJ) to the more challenging task of transcribing found speech like broadcast news (Hub-4 task) and telephone conversations (Switchboard). Available training corpora tend to become larger and more erroneous than before, as transcribing found speech is more difficult. In this paper we pre...

متن کامل

Semantic and phonetic automatic reconstruction of medical dictations

Automatic speech recognition (ASR) has become a valuable tool in large document production environments like medical dictation. While manual post-processing is still needed for correcting speech recognition errors and for creating documents which adhere to various stylistic and formatting conventions, a large part of the document production process is carried out by the ASR system. For improvin...

متن کامل

Multipass algorithm for acquisition of salient acoustic morphemes

We are interested in spoken language understanding within the domain of automated telecommunication services. Our current methodology involves training statistical language models from large annotated corpora for recognition and understanding. Since the transcribing of large speech corpora is a resource consuming task, we are motivated to exploit speech without transcriptions. In particular, we...

متن کامل

Discriminative rescoring based on minimization of word errors for transcribing broadcast news

This paper describes a novel method of rescoring that reflects tendencies of errors in word hypotheses in speech recognition for transcribing broadcast news, including ill-trained spontaneous speech. The proposed rescoring assigns penalties to sentence hypotheses according to the recognition error tendencies in the training lattices themselves using a set of weighting factors for feature functi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007